Goto

Collaborating Authors

 normalization factor


Learning in Compact Spaces with Approximately Normalized Transformer

Franke, Jörg K. H., Spiegelhalter, Urs, Nezhurina, Marianna, Jitsev, Jenia, Hutter, Frank, Hefenbrock, Michael

arXiv.org Artificial Intelligence

The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but do not increase the total number of normalization layers. Our experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. When deriving scaling laws, we found that our method enables training with larger batch sizes while preserving the favorable scaling characteristics of classic GPT architectures.


Online Hybrid-Belief POMDP with Coupled Semantic-Geometric Models and Semantic Safety Awareness

Lemberg, Tuvy, Indelman, Vadim

arXiv.org Artificial Intelligence

Robots operating in complex and unknown environments frequently require geometric-semantic representations of the environment to safely perform their tasks. While inferring the environment, they must account for many possible scenarios when planning future actions. Since objects' class types are discrete and the robot's self-pose and the objects' poses are continuous, the environment can be represented by a hybrid discrete-continuous belief which is updated according to models and incoming data. Prior probabilities and observation models representing the environment can be learned from data using deep learning algorithms. Such models often couple environmental semantic and geometric properties. As a result, semantic variables are interconnected, causing semantic state space dimensionality to increase exponentially. In this paper, we consider planning under uncertainty using partially observable Markov decision processes (POMDPs) with hybrid semantic-geometric beliefs. The models and priors consider the coupling between semantic and geometric variables. Within POMDP, we introduce the concept of semantically aware safety. Obtaining representative samples of the theoretical hybrid belief, required for estimating the value function, is very challenging. As a key contribution, we develop a novel form of the hybrid belief and leverage it to sample representative samples. We show that under certain conditions, the value function and probability of safety can be calculated efficiently with an explicit expectation over all possible semantic mappings. Our simulations show that our estimates of the objective function and probability of safety achieve similar levels of accuracy compared to estimators that run exhaustively on the entire semantic state-space using samples from the theoretical hybrid belief. Nevertheless, the complexity of our estimators is polynomial rather than exponential.


Calo-VQ: Vector-Quantized Two-Stage Generative Model in Calorimeter Simulation

Liu, Qibin, Shimmin, Chase, Liu, Xiulong, Shlizerman, Eli, Li, Shu, Hsu, Shih-Chieh

arXiv.org Artificial Intelligence

We introduce a novel machine learning method developed for the fast simulation of calorimeter detector response, adapting vector-quantized variational autoencoder (VQ-VAE). Our model adopts a two-stage generation strategy: initially compressing geometry-aware calorimeter data into a discrete latent space, followed by the application of a sequence model to learn and generate the latent tokens. Extensive experimentation on the Calo-challenge dataset underscores the efficiency of our approach, showcasing a remarkable improvement in the generation speed compared with conventional method by a factor of 2000. Remarkably, our model achieves the generation of calorimeter showers within milliseconds. Furthermore, comprehensive quantitative evaluations across various metrics are performed to validate physics performance of generation.


Quantum linear algebra is all you need for Transformer architectures

Guo, Naixu, Yu, Zhan, Choi, Matthew, Agrawal, Aman, Nakaji, Kouhei, Aspuru-Guzik, Alán, Rebentrost, Patrick

arXiv.org Artificial Intelligence

Generative machine learning methods such as large-language models are revolutionizing the creation of text and images. While these models are powerful they also harness a large amount of computational resources. The transformer is a key component in large language models that aims to generate a suitable completion of a given partial sequence. In this work, we investigate transformer architectures under the lens of fault-tolerant quantum computing. The input model is one where trained weight matrices are given as block encodings and we construct the query, key, and value matrices for the transformer. We show how to prepare a block encoding of the self-attention matrix, with a new subroutine for the row-wise application of the softmax function. In addition, we combine quantum subroutines to construct important building blocks in the transformer, the residual connection and layer normalization, and the feed-forward neural network. Our subroutines prepare an amplitude encoding of the transformer output, which can be measured to obtain a prediction. Based on common open-source large-language models, we provide insights into the behavior of important parameters determining the run time of the quantum algorithm. We discuss the potential and challenges for obtaining a quantum advantage.


State Derivative Normalization for Continuous-Time Deep Neural Networks

Weigand, Jonas, Beintema, Gerben I., Ulmen, Jonas, Görges, Daniel, Tóth, Roland, Schoukens, Maarten, Ruskowski, Martin

arXiv.org Artificial Intelligence

The importance of proper data normalization for deep neural networks is well known. However, in continuous-time state-space model estimation, it has been observed that improper normalization of either the hidden state or hidden state derivative of the model estimate, or even of the time interval can lead to numerical and optimization challenges with deep learning based methods. This results in a reduced model quality. In this contribution, we show that these three normalization tasks are inherently coupled. Due to the existence of this coupling, we propose a solution to all three normalization challenges by introducing a normalization constant at the state derivative level. We show that the appropriate choice of the normalization constant is related to the dynamics of the to-be-identified system and we derive multiple methods of obtaining an effective normalization constant. We compare and discuss all the normalization strategies on a benchmark problem based on experimental data from a cascaded tanks system and compare our results with other methods of the identification literature.


Automatic Hyperparameter Tuning in Sparse Matrix Factorization

Kawasumi, Ryota, Takeda, Koujin

arXiv.org Artificial Intelligence

Among machine learning problems, matrix factorization (MF) is significant because MF appears in many applications such as recommendation system, signal processing, etc. We restrict ourselves to sparse MF problem in this article, where either factorized matrix must be sparse. This is originally discussed as sparse coding in neuroscience [1, 2], and recognized as a significant problem in neuronal information processing in the brain. It also appears in sparse modeling in information science such as dictionary learning [3, 4] or sparse principal component analysis (sparse PCA) [5, 6]. Many attempts have been made so far for understanding theoretical aspects of MF, and analytical tools for random systems in statistical physics are found to be useful, e.g. Markov chain Monte Carlo method [7], replica analysis [8, 9, 10, 11, 12], and message passing [9, 10, 11, 12, 13, 14], where some works are not limited to sparse matrix case.


A Parametric Similarity Method: Comparative Experiments based on Semantically Annotated Large Datasets

De Nicola, Antonio, Formica, Anna, Missikoff, Michele, Pourabbas, Elaheh, Taglino, Francesco

arXiv.org Artificial Intelligence

We present the parametric method SemSimp aimed at measuring semantic similarity of digital resources. SemSimp is based on the notion of information content, and it leverages a reference ontology and taxonomic reasoning, encompassing different approaches for weighting the concepts of the ontology. In particular, weights can be computed by considering either the available digital resources or the structure of the reference ontology of a given domain. SemSimp is assessed against six representative semantic similarity methods for comparing sets of concepts proposed in the literature, by carrying out an experimentation that includes both a statistical analysis and an expert judgement evaluation. To the purpose of achieving a reliable assessment, we used a real-world large dataset based on the Digital Library of the Association for Computing Machinery (ACM), and a reference ontology derived from the ACM Computing Classification System (ACM-CCS). For each method, we considered two indicators. The first concerns the degree of confidence to identify the similarity among the papers belonging to some special issues selected from the ACM Transactions on Information Systems journal, the second the Pearson correlation with human judgement. The results reveal that one of the configurations of SemSimp outperforms the other assessed methods. An additional experiment performed in the domain of physics shows that, in general, SemSimp provides better results than the other similarity methods.


Hybrid Belief Pruning with Guarantees for Viewpoint-Dependent Semantic SLAM

Lemberg, Tuvy, Indelman, Vadim

arXiv.org Artificial Intelligence

Semantic simultaneous localization and mapping is a subject of increasing interest in robotics and AI that directly influences the autonomous vehicles industry, the army industries, and more. One of the challenges in this field is to obtain object classification jointly with robot trajectory estimation. Considering view-dependent semantic measurements, there is a coupling between different classes, resulting in a combinatorial number of hypotheses. A common solution is to prune hypotheses that have a sufficiently low probability and to retain only a limited number of hypotheses. However, after pruning and renormalization, the updated probability is overconfident with respect to the original probability. This is especially problematic for systems that require high accuracy. If the prior probability of the classes is independent, the original normalization factor can be computed efficiently without pruning hypotheses. To the best of our knowledge, this is the first work to present these results. If the prior probability of the classes is dependent, we propose a lower bound on the normalization factor that ensures cautious results. The bound is calculated incrementally and with similar efficiency as in the independent case. After pruning and updating based on the bound, this belief is shown empirically to be close to the original belief.


Dot Product is All You Need

#artificialintelligence

Most of us have heard about this concept whether through deliberate study from various books, papers, and videos or involuntary knowledge from our friends who talk a lot about mathematics. This concept is often introduced in basic linear algebra courses. Suppose you have two vectors a [1, 2, 4] and b [2, 5, 8]. By applying dot product to those vectors you will get a scalar value of 44. The dot product itself is defined by the following convention.


Learning Continuous Exponential Families Beyond Gaussian

Ren, Christopher X., Misra, Sidhant, Vuffray, Marc, Lokhov, Andrey Y.

arXiv.org Machine Learning

We address the problem of learning of continuous exponential family distributions with unbounded support. While a lot of progress has been made on learning of Gaussian graphical models, we are still lacking scalable algorithms for reconstructing general continuous exponential families modeling higher-order moments of the data beyond the mean and the covariance. Here, we introduce a computationally efficient method for learning continuous graphical models based on the Interaction Screening approach. Through a series of numerical experiments, we show that our estimator maintains similar requirements in terms of accuracy and sample complexity compared to alternative approaches such as maximization of conditional likelihood, while considerably improving upon the algorithm's run-time.